Skip to content

Implement the new tuning API for DeviceRleDispatch#7669

Merged
bernhardmgruber merged 7 commits intoNVIDIA:mainfrom
bernhardmgruber:tuning_rle_non_trivial
Feb 23, 2026
Merged

Implement the new tuning API for DeviceRleDispatch#7669
bernhardmgruber merged 7 commits intoNVIDIA:mainfrom
bernhardmgruber:tuning_rle_non_trivial

Conversation

@bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Feb 13, 2026

Depends on:

Fixes: #7532

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Feb 13, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@bernhardmgruber
Copy link
Contributor Author

There are substantial SASS diffs, for example in

Function : void cub::_V_300400_SM_750_800_860_900_1000::detail::rle::DeviceRleSweepKernel<cub::_V_300400_SM_750_800_860_900_1000::detail::rle::non_trivial_runs::policy_selector_from_types<long, cuda::std::__4::complex<float>>, cuda::std::__4::complex<float> const*, long*, long*, long*, cub::_V_300400_SM_750_800_860_900_1000::ReduceByKeyScanTileState<long, int, true>, cuda::std::__4::equal_to<void>, int, long, cub::_V_300400_SM_750_800_860_900_1000::detail::rle::streaming_context<cuda::std::__4::complex<float> const*, long, long>>(cuda::std::__4::complex<float> const*, long*, long*, long*, cub::_V_300400_SM_750_800_860_900_1000::ReduceByKeyScanTileState<long, int, true>, cuda::std::__4::equal_to<void>, int, int, cub::_V_300400_SM_750_800_860_900_1000::detail::rle::streaming_context<cuda::std::__4::complex<float> const*, long, long>)

on SM75.

@bernhardmgruber bernhardmgruber marked this pull request as ready for review February 22, 2026 12:55
@bernhardmgruber bernhardmgruber requested review from a team as code owners February 22, 2026 12:55
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Feb 22, 2026
@bernhardmgruber
Copy link
Contributor Author

There are substantial SASS diffs, for example in [...] on SM75.

With the change from #7733, I asked Cursor:

The last two commits lead to SASS differences on the benchmark cub.bench.run_length_encode.non_trivial_runs.base for SM75. This was likely caused by the rewrite of the tuning information and dispatch layer of DeviceRleDispatch. Please find out why the SASS changed and fix the new tuning and dispatch code so the SASS change is gone. You are not allowed to revert to the old code.

and it indeed found the root cause. I am impressed.

@github-actions

This comment has been minimized.

Comment on lines +523 to +524
// TODO(bgruber): I think we want `LengthT` instead of `int`
return make_default_policy(BLOCK_LOAD_DIRECT, sizeof(int), LOAD_LDG);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retaining comment on old code

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@bernhardmgruber bernhardmgruber enabled auto-merge (squash) February 23, 2026 20:26
@github-actions
Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 21h 20m: Pass: 100%/249 | Total: 9d 05h | Max: 3h 57m | Hits: 71%/153868

See results here.

@bernhardmgruber bernhardmgruber merged commit c3161ff into NVIDIA:main Feb 23, 2026
527 of 531 checks passed
@bernhardmgruber bernhardmgruber deleted the tuning_rle_non_trivial branch February 23, 2026 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Implement the new tuning API for DeviceRleDispatch

2 participants